Optimize HNSW Memory Allocation with Chunked Arrays #633

michaelbautin · 2025-09-26T21:15:30Z

Original patch was created by @PingXie (reviewed by @yairgott). These changes originated in the valkey-search project.

Original commit message below:

This change improves the memory allocation strategy of the HNSW index data structures by adopting chunked arrays. Previously, resizing the HNSW index involved expensive realloc calls, which were particularly costly in two ways:

Memory Consumption: During resizing, realloc temporarily requires double the memory to accommodate the old data block until the new block is ready and the old block is freed. This posed a substantial issue when dealing with gigabytes of data, leading to excessive memory usage.
Performance Overhead: The realloc operation entails copying data from the old block to the new block, invoking an expensive memcpy operation. This process becomes increasingly burdensome as the data size grows, resulting in significant performance degradation for large-scale indices.
By transitioning to chunked arrays for memory allocation, we circumvent the need for realloc calls during resizing. This approach not only reduces the memory footprint by eliminating the temporary doubling of memory but also enhances overall performance by avoiding the costly memcpy operations. The chunked array strategy allocates new memory in smaller, manageable chunks as the data grows, ensuring more efficient use of memory and improving the scalability of HNSW index operations, especially critical for indices spanning multiple gigabytes.

…tempt to fix Python bindings.

…ng in Python bindings.

…fix build

PingXie

LGTM overall. Thanks for upstreaming this change, Michael!

hnswlib/hnswalg.h

yairgott · 2025-10-02T15:31:42Z

LGTM!

In case not already done, it would be useful to run a sanity test with an ASAN build.

michaelbautin · 2025-10-02T16:38:29Z

@yairgott : ASAN tests are already being run in GitHub Actions.

michaelbautin · 2025-10-29T00:00:37Z

Here are some performance test results.

Used ann-benchmarks at commit f402b2cc17b980d7cd45241ab5a7a4cc0f965e55 with the following patch applied: https://gist.githubusercontent.com/michaelbautin/fe38704f9678532ee42d0788b0fc9ccb/raw

Used the following script to alternate between unmodified and modified versions of hnswlib: https://gist.githubusercontent.com/michaelbautin/b7b6264f899e835d2da06d641c1e05f8/raw

The modified version of hnswlib (~/code/hnswlib) was at commit https://github.com/michaelbautin/hnswlib/commits/c32cc21ee5ef0797c9888bcc6519eba6304bcb4f and the unmodified/baseline version (~/code/hnswlib-baseline) was at commit https://github.com/nmslib/hnswlib/commits/2fba7fba961fffcbfb6578d3d2979419edf1e0e7.

Tested on a 96-vCPU GCE VM with an INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.30GHz CPU.

The index build times did not change significantly and were in the range of 421-424 seconds for both versions of hnswlib on the deep-image-96-angular ann-benchmarks dataset.

According to a two-sided t-test computed using the T.TEST(<range1>,<range2>,2,3) function in Google Spreadsheets, there is no statistically significant difference between index build times (20 runs, with/without changes).

Tests using speedtest.py as requested by @yurymalkov :

Baseline (2fba7fb):

$ python3 tests/python/speedtest.py -d 4 -t 1
0.025458335876464844 seconds, recall= 1.0
0.024799585342407227 seconds, recall= 1.0
0.0272064208984375 seconds, recall= 1.0
0.025821447372436523, 0.025458335876464844, 0.0010155792492704126, 0.6601953506469727, 1.0, None
$ python3 tests/python/speedtest.py -d 4 -t 64
0.3272378444671631 seconds, recall= 1.0
0.3278830051422119 seconds, recall= 1.0
0.3281400203704834 seconds, recall= 1.0
0.32775362332661945, 0.3278830051422119, 0.0003795041332998049, 0.6465816497802734, 1.0, None

With the changes:

$ python3  tests/python/speedtest.py -d 4 -t 1
0.03078746795654297 seconds, recall= 1.0
0.029574155807495117 seconds, recall= 1.0
0.02940821647644043 seconds, recall= 1.0
0.029923280080159504, 0.029574155807495117, 0.0006148167638485685, 0.6504831314086914, 1.0, None
$ python3 tests/python/speedtest.py -d 4 -t 64
0.3246622085571289 seconds, recall= 1.0
0.32256174087524414 seconds, recall= 1.0
0.32404065132141113 seconds, recall= 1.0
0.32375486691792804, 0.32404065132141113, 0.0008810015757141497, 0.6430518627166748, 1.0, None

Copilot

Pull request overview

This PR optimizes HNSW index memory allocation by replacing expensive realloc calls with a chunked array strategy. The changes significantly reduce memory consumption (eliminating temporary memory doubling during resize) and improve performance (avoiding costly memcpy operations) for large-scale indices.

Key Changes:

Introduced ChunkedArray template class for managing large array-like data with fixed-size memory chunks
Converted data_level0_memory_ and linkLists_ from raw pointers to ChunkedArray instances
Replaced conditional prefetch macros with unified HNSWLIB_MM_PREFETCH macro with bounds checking

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
hnswlib/hnswlib.h	Adds `ChunkedArray` template class and `HNSWLIB_MM_PREFETCH` macro for chunked memory management
hnswlib/hnswalg.h	Converts HNSW data structures to use `ChunkedArray` and updates prefetch calls with bounds checking
python_bindings/bindings.cpp	Updates Python bindings to use new `copyTo`/`copyFrom` methods and `getLinkListPtr`/`setLinkListPtr` accessors
CMakeLists.txt	Removes blank line (formatting)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-07T03:16:14Z

hnswlib/hnswlib.h

+#include <assert.h>
+
+#include <memory>
+#include <type_traits>


The ChunkedArray class uses std::deque on line 556, but the header <deque> is not included. This will cause a compilation error. Add #include <deque> to the includes section near the top of the file.

Suggested change

#include <type_traits>

#include <type_traits>

#include <deque>

Copilot · 2025-12-07T03:16:15Z

hnswlib/hnswalg.h

+            HNSWLIB_MM_PREFETCH((char *) (visited_array + *(data + 1)), _MM_HINT_T0);
+            HNSWLIB_MM_PREFETCH((char *) (visited_array + *(data + 1) + kCacheLineSize), _MM_HINT_T0);
+            HNSWLIB_MM_PREFETCH(getDataByInternalId(*datal), _MM_HINT_T0);
+            HNSWLIB_MM_PREFETCH(getDataByInternalId(*(datal + 1)), _MM_HINT_T0);


Prefetching *(datal + 1) without checking if size >= 2 could lead to out-of-bounds memory access. If size is 0 or 1, this prefetch will read beyond the valid data. Consider adding a bounds check: if (size >= 2) { HNSWLIB_MM_PREFETCH(...) }.

Suggested change

HNSWLIB_MM_PREFETCH(getDataByInternalId(*(datal + 1)), _MM_HINT_T0);

if (size >= 2) {

HNSWLIB_MM_PREFETCH(getDataByInternalId(*(datal + 1)), _MM_HINT_T0);

}

Copilot · 2025-12-07T03:16:15Z

hnswlib/hnswlib.h

+            size_t cur_size = std::min(bytes_per_chunk, num_bytes);
+            memcpy(destination, chunks_[chunk_index].get(), cur_size);
+            num_bytes -= cur_size;
+            destination += cur_size;


The copyTo method doesn't increment chunk_index in the loop, which means it will always copy from chunk 0 repeatedly. This will result in incorrect data being copied. The line should increment chunk_index after each iteration: chunk_index++.

Suggested change

destination += cur_size;

destination += cur_size;

++chunk_index;

Copilot · 2025-12-07T03:16:15Z

hnswlib/hnswlib.h

+    void copyFrom(const char* source, size_t num_bytes) {
+        size_t chunk_index = 0;
+        size_t bytes_per_chunk = getSizePerChunk();
+        while (num_bytes > 0) {
+            size_t cur_size = std::min(bytes_per_chunk, num_bytes);
+            memcpy(chunks_[chunk_index].get(), source, cur_size);
+            num_bytes -= cur_size;
+            source += cur_size;
+        }
+    }


The copyFrom method doesn't increment chunk_index in the loop, which means it will always copy to chunk 0 repeatedly. This will result in data corruption. The line should increment chunk_index after each iteration: chunk_index++.

Copilot · 2025-12-07T03:16:15Z

hnswlib/hnswalg.h

    linklistsizeint *get_linklist0(tableint internal_id, char *data_level0_memory_) const {
-        return (linklistsizeint *) (data_level0_memory_ + internal_id * size_data_per_element_ + offsetLevel0_);
+        return (linklistsizeint *) (data_level0_memory_[internal_id] + offsetLevel0_);
    }


The parameter data_level0_memory_ shadows the member variable with the same name but is never used in the function body. The function directly accesses data_level0_memory_[internal_id] which refers to the member variable. This unused parameter should either be removed or the function should use it instead of the member variable.

Copilot · 2025-12-07T03:16:16Z

hnswlib/hnswlib.h

+    ChunkedArray(const ChunkedArray& other) = delete;
+    ChunkedArray& operator=(const ChunkedArray& other) = delete;
+
+    ChunkedArray(ChunkedArray&& other) noexcept {


The move constructor doesn't initialize member variables before calling swap. This means it's swapping with uninitialized values. The member variables should be initialized (e.g., using the default constructor's member initializer list) before calling swap(other), or the swap logic should be replaced with direct member moves.

Suggested change

ChunkedArray(ChunkedArray&& other) noexcept {

ChunkedArray(ChunkedArray&& other) noexcept

: element_byte_size_(0),

elements_per_chunk_(0),

element_count_(0),

chunk_padding_bytes_(0) {

Copilot · 2025-12-07T03:16:16Z

hnswlib/hnswlib.h

+MallocUniqueCharArrayPtr makeUniqueCharArray(size_t n_bytes) {
+    char* raw_ptr = static_cast<char*>(malloc(n_bytes));
+    return MallocUniqueCharArrayPtr(raw_ptr);
+}


The makeUniqueCharArray function doesn't check if malloc returns nullptr before wrapping it in a unique_ptr. If memory allocation fails, this will create a unique_ptr containing nullptr, which could lead to crashes when dereferenced. Consider checking for allocation failure and throwing an exception or returning an error status.

michaelbautin added 9 commits September 26, 2025 21:14

Chunked memory allocation

688f06a

Turn prefetching back on

5de0c43

Fix compilation

e23f6ca

Fix another invalid prefetch. Better output on range check error.

a070bc5

Attempting to fix Python bindings

11bffd7

Another attempt to fix the Python bindings build

17ef9d3

Fix copyFrom argument (should be const char*)

0912989

Simplify managing chunked array of neighbor list pointers. Another at…

55644e8

…tempt to fix Python bindings.

Clear level 0 memory when calling clear(). Fix HNSW pickling/unpickli…

cc68c8f

…ng in Python bindings.

michaelbautin changed the title ~~Chunked memory allocation~~ Optimize HNSW Memory Allocation with Chunked Arrays Sep 29, 2025

michaelbautin requested review from dyashuni and yurymalkov September 29, 2025 05:13

michaelbautin added 2 commits September 29, 2025 07:08

Make the number of elements per chunk configurable

517f5eb

Make field declaration order consistent with initialization order to …

12fc750

…fix build

PingXie approved these changes Oct 1, 2025

View reviewed changes

hnswlib/hnswalg.h Outdated Show resolved Hide resolved

Revert unintended indentation

c32cc21

Fix a bug with reading/writing the last chunk

ba9695a

yurymalkov requested a review from Copilot December 7, 2025 03:12

Copilot started reviewing on behalf of yurymalkov December 7, 2025 03:13 View session

Copilot AI reviewed Dec 7, 2025

View reviewed changes

	#include <type_traits>
	#include <type_traits>
	#include <deque>

	destination += cur_size;
	destination += cur_size;
	++chunk_index;

-    ChunkedArray(ChunkedArray&& other) noexcept {
+    ChunkedArray(ChunkedArray&& other) noexcept
+        : element_byte_size_(0),
+          elements_per_chunk_(0),
+          element_count_(0),
+          chunk_padding_bytes_(0) {

Optimize HNSW Memory Allocation with Chunked Arrays #633

Are you sure you want to change the base?

Optimize HNSW Memory Allocation with Chunked Arrays #633

Conversation

michaelbautin commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yairgott commented Oct 2, 2025

Uh oh!

michaelbautin commented Oct 2, 2025

Uh oh!

michaelbautin commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelbautin commented Sep 26, 2025 •

edited

Loading

michaelbautin commented Oct 29, 2025 •

edited

Loading